Proof-of-concept implementation of the massively parallel algorithm 
for simulation of dispersion-managed WDM optical fiber systems 
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We perform a proof-of-concept implementation of the massively parallel algorithm (P.M. Lushnikov, Opt. 
Lett., V. 27, 939 (2002)) for simulation of dispersion-managed wavelength-division-multiplexed optical fiber 
systems. Linear scalability of the algorithm with the number of computer cores is demonstrated. Exact result 
on the accuracy of the implemented algorithm is found analytically and confirmed numerically as well as it is 
compared with the accuracy of the standard split-step algorithm. © 2011 Optical Society of America 
OCIS codes: (060.2330) (190.4370) (190.5530) 
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A wavelength-division-multiplexed (WDM) 
dispersion-managed (DM) optical fiber system is 
the basis of current high-bit-rate optical communica- 
tions. Next generation of these systems will use both the 
amplitude and phase of the optical signal as a carrier 
of information (see e.g. [I1I2]) to achieve higher system 
performance. WDM systems are weakly nonlinear ones 
with a linear dispersion length typically in rage of 
tens while a nonlinear length is at several hundreds of 
kilometers [3]-[6]. Nonlinearity is a major factor limiting 
performance of such systems while linear effects can be 
significantly compensated by coherent detection. 

WDM requires propagation of a wide range of fre- 
qunces through optical fiber coupled by the nonlinear- 
ity. Path-averaged group-velocity dispersion (GVD) ef- 
fects cause optical pulses in distinct WDM channels to 
move with different group velocities. Consequently mod- 
eling of WDM systems requires simulating a long time 
interval, which determines needed high resolution in the 
frequencies. All that makes accurate numerical simula- 
tions enormously challenging with very large number of 
Fourier modes N needed to be resolved. The standard 
algorithm for such simulation is an operator splitting, 
or split-step algorithm (SS). It involves several Fourier 
transforms for every spatial step along optical fiber. The 
fast Fourier transform (FFT) algorithm computes each 
such transformation in 0{N log N) operations of multi- 
plication. The efficiency of using of supercomputers for 
SS is limited because parallel algorithms for one dimen- 
sional FFT (contrary to multidimensional FFT) provide 
only very moderate speed up. For example, one of the 
leading implementations [7] shows at best four times 
acceleration on 16 processor cores on the system with 
shared memory. Further increase of CPU number ap- 
pears to be inefficient. Further increase of number of 
processors on conventional systems requires use of a dis- 
tributed memory approach (cluster), which has higher 
latency of nodes interconnection media. The experimen- 
tal data shows [7] that at number of harmonics up to 



2^^, a shared memory approach is more efficient. At a 
higher number of harmonics, moderate acceleration can 
be achieved on a cluster, although scaling would still be 
far from linear. 

Here we demonstrate the proof-of-concept realization 
of the massively parallel algorithm (MPA) for simula- 
tion of WDM systems that is free from all these limita- 
tions. MPA was proposed by one of the authors of this 
Letter in [8] and exploits weak nonlinearity of WDM 
system. We demonstrate the linear scalability of perfor- 
mance with number of computer cores. We also obtain 
the exact result on the accuracy of the algorithm in com- 
parison with SS. The results are in full agreement with 
numerics. 

We neglect polarization effects, stimulated Raman 
scattering, and Brillouin scattering. Then the propaga- 
tion of WDM optical pulses in DM fiber is described by 
a scalar nonlinear Schrodinger equation 

iA, - \p2{z)Au - \P3{z)Aut + ^{z)\A\^A = iG{z)A, (1) 

where G(z) = {-7 + [exp (zaj) - l]J^^^j^S{z - Zk)}] z is 
the propagation distance along an optical fiber; A{t, z) 
is the slow amplitude of light; f32 and (5^ are the first- 
and second-order GVD, respectively, which are periodic 
functions of z; cr = (277712)/ (Ao^e//) is the nonlinear 
coefficient; 712 is the nonlinear refractive index; Aq is the 
carrier wavelength; Ai,f f is the effective fiber area; zj, = 
kZa {k = 1,2, . . . , N) are amplifier locations; and 7 is 
the loss coefficient. Distributed amplification can be also 
included in G{z). 

Applying Fourier transform A{uj,z) = F[A{t,z)\ = 
^°°^A{t,z)exp[\ujt]dt to Eq. ([T]), using the change of 

variables A{uj, z) = ^{uj, z) exp[i/3(cj, z) + J^^ G{z')dz'] 
and integration over z result in the integral equation 

^(w,z) =iA(w,^o)+i / <j{z')F[\A{t,z')\^A{t,z')] 
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-if3iu,z')-f: G(z")dz'' , 



dz', (2) 
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where /3(c.,z) = /^^ [^/32(^') + ^P:,{z')\Az' . 

Case V'l'^i -2^) = const corresponds to the exact solu- 
tion of the hnear part of Eq. ([T|) [or, equivalently, setting 
F[-] = in ©]. Assume that the nonhnearity is weak, 
Zni 3> Zdisp, where Zni = is a characteristic non- 

hnear length, Zdisp = t'^ /\P2\ is the dispersion length, 
and p and r are typical pulse amplitude and width, re- 
spectively. Then z) is a slow function of z on any 
scale L < z„i (see gUSKin]). We solve Eq. ^ by iter- 
ations for < z — zq < X, where we have a freedom of 
choice of L with the only condition that L < Zdisp- For 
the first iteration we set -^^"^ (w, z) = 4'{uj, zq) = A{uj, zq) 

and, respectively, A{z,uj) = -0*^"' (w, z)e''^^"'^''^'^^o '^^^ '''^ 
on the right hand side (rhs) of Eq. ([T]), which gives the 
first iteration -0'^^^(w,z) for the left-hand side (Ihs) of 
Eq. ([1]). Similarly, substitution of ■0'^""^^ (w, z) in the 
rhs. of Eq. ([T|) gives (oj, z) in the Ihs of Eq. ([T]) for 
n = 1,2, . . .. 

In simulations we use ip{u;, zq) with zo = mi for a 
given m = 0, 1, . . . to perform a total number of itera- 
tions ritot to approximate + as ■!/;("*°') (w, zg -I- 
i). Then we use that approximate value as starter for 
the next spatial interval by setting zq = (m + 1)L and 
proceeding in a similar way. 

Assume that the interval zq < z < zq + L includes 
M + 1 equally spaced points zg, zi, . . . , zm = zq + L. 
The MPA is based on these iterations as follows; 
1. A{lu, Zq) ~ -Zq), copy ip{uj, Zq) in ijj{uj, z) at all z. 



i/3(u;,z) + /; G{z')dz' 



at all z. 



2. Find z) = 'ip{w, z)e 

3. In order to return to i-domain, calculate independent 
Fourier transforms z) ~ [/l(a;, z)] at all z. 

4. Calculate independent Fourier transforms V{lj^ z) = 
F[\A{t,z)\^A{t,z)] at all z. 

5. Numerical integration (summation) by trapezoidal 
rule of the integral in Eq. ([2]) using V^(w, z) from step 
4. Save intermediate results of integration at every z. 

6. For the second, third, etc., iterations go to step 2. 

7. Reconstruct A{u)^ zm) on the far edge of interval. 
The MPA is schematically shown in Fig. [TJ All steps 
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Fig. 1. Schematic representation of MPA. 

of the MPA are computed parallelly. The most time- 
consuming steps are 3 and 4. All FFTs at each z are 
independently performed in CPU cores (vertical bars in 
Fig. [1]). Calculations in steps 2 and 5 are done for ev- 
ery harmonic independently (dashed horizontal line in 
Fig. HI . 



We implement the MPA for shared memory symmet- 
ric multiprocessor (SMP) architecture. The only pow- 
erful SMP computer in exclusive use was the HP Su- 
perDome 64000 supercomputer, equipped with 64 HP 
PA-RISC processors (http : //jscc . ru). Both processors 
and memory bandwidth are outdated and relatively slow. 
However, for the proof-of-concept simulation the main 
criteria is the number of processors in the system. 

Simulations were performed in a setup identical to the 
one used in [B] with pseudorandom sequences of optical 
pulses in five channels of 20 periods of WDM DM sys- 
tem. The main difference between the current MPA im- 
plementation and the original algorithm [8] is in a more 
efficient way to handle summation in step 5, optimiz- 
ing CPUs cache use. We achieve ^ 30 times speed up 
with respect to a single processor version of the code. 
Fig. [2] shows the scalability of performance. Scaling is 
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^■"-^^^^^^ linear performance scaling 
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Fig. 2. Scalability of MPA on HP SuperDome 64000. 

close to linear up to 32 processors. Then memory band- 
width limitation of the available SMP computer makes 
further parallelization less efficient. Another reason for 
that was restriction on the memory usage, which lim- 
ited the number of Fourier harmonics. However, we see 
a clear tendency of the scalability improvement with an 
increase of the number of harmonics because the longer 
time computer spends in FFTs the less important com- 
munications arc. 

To find the accuracy of the MPA we put the exact 
solution of Eq. ^ in the operator form as 



A, 



t(z) =exp[z(£-|-AA)z]A(0), 



(3) 



where C represents all linear terms and M represents 
the nonlinear term in Eq. ([I} and we set zq = 0. Here 
and below for brevity we omit the argument t of function 
^(z,t). We assume that /32(z), (3^{z),a{z), G{z) are con- 
stant functions of z at each interval of length L, although 
a generalization to more general dependence on z is 
straightforward. SS uses the efficiency and high precision 
of the simulations for exp[i£z]^(0) and exp[i7V'z]^(0). 
But C and M do not commute and we need to approxi- 
mate Eq. Q for z = L by the composite M steps of SS 
as 

Ass{L) = exp[i£Az/2]g^^exp[-i£Az/2]A(0), (4) 

where Az = L/M, Q = exp[iA/'Az] exp[i£Az]. Taylor 
series expansion of operators in Eqs. (|4]) and dU gives the 
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following error ri = Aexact{L) — Ass{L) of composite SS 
for arbitrary M assuming L ^ z^^sp *^ ^rii- 

n = i[L^/{2M^)]PL + [L^/M^]0{CN^A^), (5) 

where Aq = A(0), 0{L^M^ Aq) means different com- 
binations of terms with fcth power of C and Ith 
power of TV". Also, Pl = \Ao{CAof - \Ao\CAo\^ + 
i|Ao|2£2Ao-i£(|Ao|20o) + i^A2£2Ao + i£(A2£Ao) + 
j;^£^(|AopAo) represents all terms with the second 
power in C. Note that the operator expansions for ar- 
bitrary M are not trivial and requires the extensive use 
of the symbolic computations. 

Discretization of iterations over z in Eq. ^ with 
ijjj{uj) = il}{zj,uj)^ Zj ~ j^z, j = 0, 1, . . . , M at each 
N discrete values of a; is given by: 

T/(") T/(") 

where ? = 1, 2, . . . , A/, ^/"^ = a{zj)F[\A^;'> \^Af] 
Azexp[-i/3(w,z,) - J^''G{z)dz], A'f^ 

■0^"' exp[i/3(a;, Zj) + J^^ G{z)dz], and ip'l"'\uj) is 
the nth iteration of ip while for zero iteration 
■^j"-* = j = 0, 1, 2, ... , M. From comparison of the 
operator expansion for nth iteration with the operator 
expansion of the exact solution Eq. ([3]) we obtain (again 
assuming that (32{z), (33{z),a{z),G{z) are constant 
functions of z) that the error r2 = Ae^actiL) — ^("^(L) 
of composite SS for arbitrary M : 

= + (^-^'^°) + O (a^"+^^o) , (7) 

where Pl is the same as in Eq. ([5]) and we assume that 
n > 3, which ensures that r2 at leading order 0{L^) 
does not depend on n. For n ~ 2 the additional error 
term is O^L"^ /]VP]C^AfAo), which can be of the same 
order as L^Pl/M^ provided L ~ Zdisp- But for practical 
realization of MPA we expect that L <^ z^isp and then 
n-tot = 2 can be also an optimal choice. 

Error term oc Pl dominates in Eq. ([5]) while in Eq. ([7]) 
it competes with the last term in the rhs which has the 
order 0{N'"+'^Ao) ~ {L/zni)"+^Ao and is independent 
of M because it results from the iterations of Eq. ([2]) in 
the continuous limit M — > cxd. An increase of n ensures 
dominance of oc Pl in Eq. ([7]) because L < Zdisp ^ Zni- 
Then we conclude from comparison of Eqs. ([7|) and ([5]) 
that SS error is twice smaller than the MPA. So to match 
the accuracy of SS it is enough for the MPA to take M 
by a factor 2-^^^ larger. Respectively, the MPA requires a 
minimum 2^/^n + 1 CPU cores to outperform SS (2^/^n 
would be the exact match of performance) . For example, 
in simulations with the parameters of Fig. [5J 250 DM 
periods and n = 2 we obtained the ratio 9.2 of SS and 
MPA computation times at equal accuracy and 32 cores, 
which is close to the theoretical prediction 2~^/^32n~^. 



To check these analytical predictions we simulated a 
three-channel WDM system over one period of DM fiber 
system with 2^^ frequency harmonics, L = 20km for the 
standard fiber, and other parameters as in [8]. As a "nu- 
merically exact" we use SS with 2^^ grid steps over one 
DM system period. Fig. [3] shows that the error of the 
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Fig. 3. Errors in Loo norm (maximum over t) of MPA and 
SS vs. M. Results for 4th and 5th iterations are visually 
indistinguishable with the theoretical scaling line. Inset 
shows the error (normalized to max |'0(i)|) with t for 20- 
channel WDM system after 10"* km. 

MPA with n = 3 scales as one for SS. That is, n = 3 is 
enough to neglect 0{J\f^^'^^ Ao) term in Eq. d?)). Errors 
for MPA and SS are different by a factor 2 in full agree- 
ment with Eqs. ([5]) and We also compare the MPA 
and SS for simulation of the transoceanic distance lO** 
km (250 DM periods) of the realistic WDM system with 
20 channels using N = 2", M = 2", L = 1.25km. That 
system has (20/3)^ higher nonlinearity than above so we 
decreased L. The inset of Fig. [3] shows the error of the 
MPA with n = 3 in that case. A ratio of MPA and SS 
Loo errors (i.e., max over t in that inset and similar for 
SS) is ~ 2.1, again close to 2. 

In conclusion, we have demonstrated the feasibility of 
the MPA. Scaling of the parallel version on the avail- 
able SMP machine was close to linear up to 32 process- 
ing threads even on outdated architecture and with a 
very restricted size of EFT arrays. We propose using a 
shared memory model for parallel computation, which 
has lower penalties due to interprocess communications, 
and exploiting the power of modern graphics processing 
units (CPUs). Nvidia Tesla C2070 GPU has 448 cores 
and 6GB memory, which appears quite suitable for the 
MPA. 
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